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1]  abstract 


A  method  that  blends  tree-structured  nonparametric  regression 
with  classical  maximum  likelihood  is  used  in  a  generalized  regression 
setting  The  function  estimates  constructed  are  piecewise  polynomials 
and  are  produced  together  with  decision  trees  containing  useful  infor¬ 
mation  on  the  regressors  Fitting  is  carried  out  by  applying  maximum 
likelihood  estimation  to  subsets  of  the  data,  where  the  subsets  are 
selected  via  recursive  partitioning  and  cross-validation  pruning  Ex¬ 
amples  of  Poisson  and  logistic  re-ression  trees  are  given  to  illustrate 

the  method  applied  to  count  and  binary  response  data  Large-sample 
properties  of  the  estimates  are  derived  under  appropriate  regularity 
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Abstract 

A  method  that  blends  tree-structured  nonparametric  regression 
with  classical  maximum  likelihood  is  used  in  a  generalized  regression 
setting.  The  function  estimates  constructed  are  piecewise  polynomials 
and  are  produced  together  with  decision  trees  containing  useful  infor¬ 
mation  on  the  regressors.  Fitting  is  carried  out  by  applying  maximum 
likelihood  estimation  to  subsets  of  the  data,  where  the  subsets  are 
selected  via  recursive  partitioning  and  cross-validation  pruning.  Ex¬ 
amples  of  Poisson  and  logistic  regression  trees  are  given  to  illustrate 
the  method  applied  to  count  and  binary  response  data.  Large-sample 

'Chaudhuri's  research  was  partially  supported  by  a  grant  from  the  Indian  Statistical 
Institute.  Loh’s  research  was  partially  supported  by  ARO  grant  DAAL03-91-G-OU1. 
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properties  of  the  estimates  are  derived  under  appropriate  regularity 
conditions. 

hey  words  and  phrases:  Generalized  linear  models.  Anscombe  resid¬ 
ual,  pseudo  residual.  Vapnik-Chervonenkis  class,  consistency 


1  Introduction:  Motivation  and  main  ideas 

» 

Consider  a  general  regression  set  up  in  which  a  real-valued  response  Y  is 
related  to  a  real  or  a  vector- valued  regressor  A'  through  an  appropriate  prob¬ 
ability  model,  which  characterizes  the  nature  of  the  dependence  of  V'  on  .V.  ^ 

To  be  more  specific,  let  us  denote  the  conditional  density  or  mass  function 
of  Y  given  A'  =  x  as  /{vl5(x)},  where  the  form  of  /  is  known  but  g  is  an  un¬ 
known  function,  which  happens  to  be  the  parameter  of  interest  here.  There 

are  plenty  of  examples  that  arise  in  practice  and  fit  into  this  structure.  Some  > 

well-known  cases,  which  have  received  extensive  attention  in  the  literature, 
are  the  logistic  regression  model  (when  the  response  Y  is  binary,  and  g{i) 
is  the  "logit"  of  the  conditional  probability  parameter  given  .V  =  x).  the 
Poisson  regression  model  (when  V'  is  a  nonnegative  integer- valued  random 
variable  with  a  Poisson  distribution,  and  j;(x)  is  related  to  its  unknown  con¬ 
ditional  mean  given  A'  =  x),  and  more  generally,  models  that  are  popularly 

called  generalized  linear  models  (GLM)  (Nelder  and  Wedderburn  1972.  Me-  > 

CuUagh  and  Nelder  1989),  where  g  is  related  to  the  link  function.  On  the 
other  hand,  g(x)  may  be  the  unknown  location  parameter  associated  with 
the  conditional  distribution  of  Y  given  A'  =  x.  In  other  words,  Y  mav  sat- 

» 

isfy  the  equation  Y  =  g(X)  +  (,  where  the  conditional  distribution  of  c  can 
be  normal.  Cauchy  or  exponential  power  tsee.  e.g..  Box  and  Tiao  197.3)  with 
center  at  zero. 

» 
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We  are  interested  in  the  situation  where  no  hnite-ilitnensional  paraiiieiric 
model  is  imposed  on  g.  and  it  i,-.  a>siitned  to  he  a  smooth  fiimtion  with  an 
appropriate  degree  of  stiioot  hness.  Non  parametric  estimation  of  the  fnnr 
tional  parameter  g  has  been  explored  by  Cox  and  O'Sullivan  il()‘K)i.  Cii 
(1990),  Hastie  and  Tibshirani  ( 19X6.  1990).  O’.Siillivan.  \andell  and  Ravnor 
(1986).  Staniswalis  ( 19X9).  Stone  ( 1986.  1991a),  and  others,  whrj  considered 
various  nonparametric  smoothers  when  the  conditional  distribution  of  the 
response  given  the  regressor  is  assumed  to  have  a  known  shape  (e  g.,  the 
conditional  distribution  may  possess  a  GL.M-type  exponential  structure). 

In  the  case  of  the  usual  regression  set  up.  where  V  =  y\X  \  +  f  with 
=  0.  several  attempts  have  been  made  to  estimate  g  by  recursively 
partitioning  the  regressor  space  and  then  constructing  a  regression  estimate 
in  each  partition  using  the  method  of  least  squares.  Important  developments 
along  this  direction  are  .AID  (Sonquist  1970.  Sonquist.  Baker  and  .Morgan 
1973),  C.ART  (Breiman,  Friedman.  Olshen  and  .Stone  1984)  and  Sl'PPORT 
(Chaudhuri.  Huang,  Loh  and  Yao  1993).  The  purpose  of  this  article  is  to 
explore  recursive  partitioning  algorithms  and  related  likelihood- based  non¬ 
parametric  function  estimates  in  a  generalized  regression  setting. 

Two  significant  advantages  enjoyed  by  recursive  partitioning  and  tree- 
structured  regression  are; 

•  the  decision  tree  as  well  as  the  intermediate  and  terminal  nodes  created 
by  the  partitioning  algorithm  may  provide  valuable  information  about 
the  regressors,  and 

•  the  estimates  constructed  in  each  terminal  node  has  a  simple  functional 
form.  This  permits  their  statistical  properties  to  be  studied  and  lends 
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insiuhl  into  th«*  natiirp  of  the  relationship  lietvveen  tlie  response  and 
the  regressors  within  a  node. 

Besides,  the  adaptive  nature  of  a  renir'ive  partitioning  algoritliiii  allows 
varying  degrees  of  snioothing  over  the  regre>M)r  ^pa<e  ~,o  that  the  tertiiinal 
nodes  may  have  variable  si/es  iti  terms  of  numbers  of  observations  contained 
in  those  nodes  as  well  as  the  diameters  of  the  .sets  in  the  regressor  space  to 
which  they  correspond.  The  ttiain  motivation  behind  such  adaptive  variable 
smoothing  is  to  take  care  of  heteroscedasticity  as  well  as  the  possibility  that 
the  amount  of  smoothness  in  the  functional  parameter  g  may  be  different 
in  different  parts  of  the  regressor  space.  This  is  an  improvement  over  most 
of  the  earlier  nonparametric  estimation  techniques  in  generalized  regression, 
which  concentrated  either  on  adaptive  but  fixed  smoothing  (i.e..  using  a 
smoothing  parameter  whose  value  is  constant  over  the  entire  regressor  space) 
or  on  deterministic  smoothing. 

The  general  methodology  explored  in  this  paper  consists  of  two  funda¬ 
mental  steps. 

1.  Observations  are  recursively  and  adaptively  divided  into  subsets  so 
that  the  unknown  function  g  can  be  satisfactorily  approximated  by  a 
simple  function  (e  g.,  a  constant,  a  linear  function  or  a  polynomial  of 
suitable  degree)  in  each  subset. 

2.  The  function  g  is  estimated  from  the  data  in  each  terminal  node  by  a 
polynomial  using  maximum  likelihood.  E.stimates  of  the  derivatives  of 
g  are  given  by  the  corresjjonding  derivatives  of  the  fitteil  polynomial. 

The  recursive  partitioning  algorithm  used  to  create  the  terminal  nodes  and 
the  nature  of  the  function  htted  will  depend  on  the  problem.  In  Sections  2 

July  20.  1993 


» 


Ar 


I 


I 


» 


» 


» 


and  3,  we  nive  some  algorithms  and  examples  for  illustration  ( see  also  C'iampi 
and  Thiffault  ( 1989)). 

Adaptive  recursive  partitioning  algorithms  construct  random  subsets  of 
the  regressor  space  which  form  the  terminal  nodes.  A  serious  technical  bar¬ 
rier  in  studying  the  analytic  properties  of  the  likebhood-based  function  esti¬ 
mates  is  the  randomness  in  these  subsets.  Our  key  tool  in  coping  with  this 
situation  is  a  well-known  combinatorial  result  in  \‘apnik  and  Chervonenkis 
(1971).  In  Section  4.  we  investigate  the  large  sample  statistical  properties 
of  the  estimates  that  are  constructed  via  recursive  partitioning  of  the  re¬ 
gressor  space  followed  by  maximum  likelihood  estimation  of  g  by  piecewise 
polynomials.  We  will  consider  a  very  general  setting  to  get  good  theoretical 
insights  into  the  performance  of  the  estimates,  and  to  derive  some  technical 
results  under  mild  regularity  conditions. 

Friedman's  (1991)  .\IAR.S  combines  spline  fitting  with  recursive  parti¬ 
tioning  to  produce  continuous  function  estimates.  The  complexity  of  the 
estimates  makes  interpretation  difficult  and  theoretical  analysis  of  their 
statistical  properties  extremely  challenging.  In  the  SUPPORT  method  of 
Chaudhuri  et  al.  ( 1993).  a  weighted  averaging  technique  is  used  to  combine 
piecewise-polynomial  fits  into  a  smooth  one.  .\n  identical  technique  can 
be  used  here  to  create  a  smooth  estimate  from  a  discontinuous  piecewise- 
polynomial  estimate  without  altering  the  asymptotic  properties  of  the  orig¬ 
inal  estimate.  Friedman  ( 1991)  gives  some  proposals  for  applying  M.ARS  to 
logistic  regression  problems,  and  Buja.  Duffy.  Hastie  and  Tibshirani  ( 1991) 
and  Stone  (1991b)  comment  on  possible  extensions  and  modifications  of 
MARS  to  CtL.M-type  exponential  response  problems.  The  methodology  pre¬ 
sented  and  analyzed  in  this  article  has  a  clear  edge  over  all  these  proposals 
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because  of  its  simplicity  and  more  direct  approach.  It  is  hoped  that  this 
will  make  it  more  appealing  to  users.  It  definitely  helps  in  interpreting  the 
estimates  and  in  studying  their  statistical  properties. 

2  Algorithms  for  Poisson  and  logistic  regression 
trees 

■Algorithms  for  fitting  Poisson  and  logistic  regression  trees  are  briefly  de¬ 
scribed  in  this  section.  Each  algorithm  has  three  main  components,  namely: 

1.  A  method  to  select  the  variable  and  the  splitting  value  to  be  used  at 
a  partition. 

2.  A  method  to  determine  the  size  of  the  tree. 

3.  A  method  to  fit  a  model  to  each  terminal  node. 

There  are  many  reasonable  solutions  for  each  component,  and  several  of 
them  are  described  and  implemented  in  FORTR.A.N'  77  in  Lo  (1993)  and 
Yang  ( 1993).  In  the  examples  in  this  paper,  two-sample  tests  for  means  and 
variances  are  used  to  find  splitting  variables  (Huang  1989,  Chaudhuri  et  al. 
1993).  cart’s  method  of  cost-complexity  pruning  (with  cost  defined  as 
deviance)  is  used  to  determine  the  size  of  a  tree.  Finally,  a  loglinear  model 
or  a  linear  logistic  regression  model  is  fitted  to  each  terminal  node.  We  begin 
with  Poisson  regression. 

2.1  Poisson  regression 

The  following  sequence  of  computations  is  performed  at  each  node  t. 
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1.  A  Poisson  loglinear  model  is  fitted  to  the  data  in  t. 

2.  Let  m,  =  EY,  and  let  di,  he  its  %'ahie  estimated  from  the  model.  .\lso 
let  y,  denote  the  observed  value  of  V,.  The  atijusted  .\nscombe  residual 
(Pierce  and  Schafer  1986) 

r.  =  {y'^'-(m:^"-(l/9)m.-‘^")}/{(2/3)my'^} 

is  calculated  for  each  y,  in  t.  (Yang,  1993.  discusses  the  advantages  of 
this  residual  over  unadjusted  Anscombe,  Pearson,  and  deviance  resid¬ 
uals.) 

3.  Observations  with  nonnegative  r,  are  classified  as  belonging  to  Group 
I  and  the  others  to  Group  2. 

4.  Two-sample  t-statistics  to  test  for  differences  in  means  and  variances 
between  the  two  groups  along  each  covariate  axis  are  computed.  (The 
latter  test  is  Levene's,  I960,  test:  see  Chaudhuri  et  al.  (1993).)  The 
rationale  is  that  if  the  model  fits  adequately,  the  residuals  should  look 
like  noise  and  there  would  be  little  difference  between  the  means  and 
variances  of  the  two  groups.  Otherwise,  one  or  more  of  the  test  statis¬ 
tics  may  be  expected  to  be  large.  This  method  has  proven  to  be  ef¬ 
fective  for  tree-structured  classification  (Loh  and  Vanichsetakul  1988) 
and  regression  with  censored  data  (.Ahn  and  Loh  1994).  Its  principal 
advantage  over  the  exhaustive  search  strategies  of  .\ID  and  C.\RT  is 
computational  speed. 

5.  The  covariate  used  to  split  the  node  is  the  one  that  possesses  the  most 
significant  (-statistic  among  all  the  tests. 
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6.  The  r\it-po'mt  for  the  selected  covariate  is  the  average  of  the  two  group 
means  along  that  covariate.  Observations  with  covariate  values  less 
than  or  equal  to  the  cut-point  are  channele<l  to  the  left  >ubnode  ami 
the  remainder  to  the  right  ^ubnoile 

7.  After  an  overly  large  tree  is  constructed,  the  nodes  are  pruned  back  fol¬ 
lowing  C.\RT's  pruning  method  with  cost -com])lexity  dehned  as  re>id 
ual  deviance  plus  a  constant  times  the  numirer  of  terminal  nodes  of 
the  tree.  As  in  CART,  10-fold  cross-validation  is  used  to  determine 
the  constant  and  hence  the  amount  of  pruning. 

8.  The  final  tree  is  the  one  that  has  the  smallest  cross-validation  estimate 
of  deviance. 

2.2  Logistic  regression 

Because  of  the  0-1  nature  of  the  K- variable  in  logistic  regression  applications, 
the  definition  of  residuals  in  the  preceding  algorithm  needs  to  be  modified  as 
follows.  Otherwise,  the  algorithm  is  similar  to  that  for  Poisson  regression. 

1.  The  T-values  are  first  smoothed  using  a  weighted  average  (similar 
to  the  LOWESS  method  of  Cleveland  (1979))  to  give  a  preliminary 
estimate  p*  of  the  probability  p,  =  P(Y,  =  1).  This  estimate  is  called 
a  “pseudo-observation." 

2.  A  second  estimate  pi  of  this  probability  from  a  logistic  regression  model 
fitted  to  the  node  is  obtained. 

3.  The  “pseudo-residual,”  r*  =  (p*  -  p,)/^{p').  is  computed  for  each 
observation.  Here  dr{p‘)  is  an  estimate  of  standard  deviation  proposed 
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Figure  1:  Pruned  free  with  10-fold  rross-valjdfttiQn  for  Poisson  example. 
The  true  nuxlel  la  log(m)  =  2sin(/i  -  /jJ  +  Xj.  The  loglinear  inodols  in  the 
tuMHlual  Ittules  are  glveit  hy  loglfo)  -  /(<).  i  -  •ll^|ht  where  I  ilenoleb  uoiIm 
number  and  /(3)  -  5.117  -  l.474ii  +  2.286i3,  /(4)  =  -0.534  -I-  1.789X|  - 

(j.4(J7i2  and  /{5)  =  1.146  -I-  fl.'TTiir,  T  O.Dffi;- 

by  Fowlkes  (1987),  whose  simulations  suggest  that  the  pseudo-residual 
is  approximately  standard  normal  and  independent  of  the  fitted  value 
for  large  samples. 

4.  The  pseudo-residual  is  used  in  place  of  the  adjusted  Anscombe  residual 
in  the  algorithm  for  Poisson  regression  trees. 


3  Numerical  examples  i 

Two  examples  are  given  in  this  section  to  illustrate  the  algorithms.  In  the 
first  example,  100  independent  (xi,X2)  pairs  are  simulated,  with  Xj  and 

X2  independent  uniformly  distributed  random  variables  over  the  intervals  ^ 

(0, 2jr)  and  (0.2),  respectively.  For  each  pair,  a  Poisson  response  is  generated 


> 
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with  mean  m  given  by 


log  m  =  ■2sin(xi  —  x^)  +  Xo. 

A  plot  of  m  versus  the  regressors  is  shown  in  Figure  2(a).  Applying  our 
Poisson  tree  algorithm  with  loglinear  fits  at  each  node  and  10-fold  cross- 
validation  pruning  gives  a  tree  with  three  terminal  nodes  as  shown  in  Fig¬ 
ure  1.  The  corresponding  piecewise-loglinear  estimated  surface  is  shown  in 
Figure  2(b).  The  fit  is  remarkably  good,  even  though  it  is  made  up  of  three 
separate  pieces. 

For  the  second  example,  we  simulate  300  independent  observation  vec¬ 
tors  (Tm-V,!, ATo),  i  =  1,...,300,  where  and  are  uniformly  and 
independently  distributed  on  the  square  (-1.5, 1.5)  x  (-1.5, 1.5),  and  Vi  is 
Bernoulli  with  probability  =  P{Yt  =  1)  given  by 


•og{p./ll  -  Pi)}  =  -Til  +sin(7rxi2). 

A  plot  of  Pi  versus  and  is  shown  in  Figure  3.  Figure  4  shows  a  tree 
with  six  terminal  nodes  constructed  by  our  logistic  regression  tree  algorithm. 
The  fitted  functions  at  the  terminal  nodes  are  log{p,(l  -  pj)  =  /(i),  where 
t  denotes  the  node  number  and 

/(4)  =  1.391  -  0.492xi  +  1.477x5, 

/(6)  =  -0.184  -0.076X, -0.002x5, 

/(8)  =  0.706  +  0.962x1  -3.198x5, 

/(9)  =  -6.420-  0.855x1+4.998x5, 
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Figure  6:  Smoothed  estimate  of  the  function  for  logistic  regression  example. 

/(lO)  =  -3.218  +  0.14TJ-1  -  ;),o.52x;. 

/(ll)  =  1.279+  1.399x1-  4.311x2. 

The  unsmoothed  and  smoothed  function  estimates  are  plotted  in  Figures  5 
and  6.  respectively.  The  smoothing  is  achieved  by  weighted  averaging  using 
trapezoidal  weights  (see  Lo  ( 1993)  for  details). 

4  Statistical  properties  of  estimates:  Some  tech¬ 
nical  results 

Assume  that  ( V'l .  A'l ),  (V^.  A'2). - (Vn.A'.,)  are  independent  data  points, 

where  the  response  V',  is  real-valued  and  the  regressor  A',  is  d-dimensional. 
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As  before,  let  /{j/i|5(j,)}  be  the  conditional  pdf/pmf  of  Y,  given  A',  = 
X,.  W'e  wish  to  estimate  the  function  g  over  a  compact  set  C  C  R'^ ■  Let 
Tn  be  a  random  partition  of  C  (i.e.,  C  =  U,fr,  Oi  which  is  generated  by 
some  adaptive  recursive  partitioning  algorithm  applied  to  the  data,  and  it 
is  assumed  to  consist  of  polyhedrons  having  at  most  M  (a  fixed  positive 
integer)  faces.  We  will  denote  the  diameter  of  a.  set  t  £  T„  by  6(t)  (i.e., 
6(t)  =  sup^  I j  —  y\).  which  will  be  assumed  to  be  positive  for  each  set 
t  €  Tn-  For  t  €  r„,  X,  will  denote  the  average  of  the  .V,'s  that  belong  to  t. 
,\lso.  assuming  that  the  function  g  is  m-th  order  differentiable  (m  >  0).  let 
us  write  its  Taylor  expansion  around  .Y,  as 

9(x)  =  ^(u!)'‘D‘‘5(.Y,)(x  -  .Y,)“  +  r,(x,A',). 

ti€t' 

Here  L'  =  {u|u  =  . !’<<).[“]  ^  where  [u]  =  Cj  -!-  is  +  . . .  + 

and  the  c, 's  are  nonnegative  integers.  For  u  £  I'.  is  the  mixed  partial 

differential  operator  with  index  u,  u!  =  f[f=i  •'•••  for  x  =  (xi.  xj . x^). 

x“  =  nif=i  -i'  (with  the  convention  that  0!  =  1  and  0“  =  1).  We  impose  the 
following  condition  (cf.  Condition  (a)  in  Chaudhuri  et  al.  ( lOQ."}))  concerning 
the  behavior  of  the  remainder  term  in  the  above  Taylor  expansion. 

Condition  1  max,gr.  suprg,{^{f  )}'"’|r,(x,  .Y,)l  —  0  as  n  —  x. 

Observe  that  if  g  is  continuously  differentiable  with  derivatives  up  to 
order  m  on  an  open  set  in  R'*  that  contains  the  compact  set  C  and  the 
diameters  of  the  sets  in  shrink  (i.e.,  if  sup, ^(f)  —  0)  in  probability  as 
n  —  X  (cf.  Condition  ( 12.9)  in  Breiman  et  al.  ( 1984)),  the  above  condition 
automatically  holds.  However,  even  if  some  of  the  sets  in  Tn  do  not  shrink 
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as  n  grows.  Condition  1  may  still  be  true.  In  any  rase.  Condition  1  implies 
that  the  function  g  can  be  uniformly  well  approximated  by  polynomials  of 
degree  smaller  than  or  ecpial  to  m  on  each  of  the  sets  in  when  n  is  large. 
For  0  =  -  let  us  define  the  polynomial  P(r-.0..V,)  in  x  as 

P(r.0..V,)  =  X! -  V.C 

uit 

Following  the  estimation  procedure  described  in  the  previous  sections,  let 
0,  be  the  estimate  obtained  by  applying  the  maximum  likelihood  technicpie 
to  the  data  points  (>',  ..V, )  for  which  .Y,  €  <  In  other  words. 

0,  =  arg  max  /{i;;P(.V..0.  .V,  .'} 

A.€r 

VVe  will  now  state  a  couple  of  conditions  concerning  the  distribution  of  the 
.V,'s  in  the  regressor  spare.  For  .Y,  €  t.  let  F,  be  the  .-i  C  i  dimensional 
column  vector  with  components  given  by  (  u!)'' {/»( r )} ’I"!!  .Y,  -  Y,  i“.  w  here 
u  €  f  .  Here  .sl  (’)  is  the  size  of  the  finite  set  which  is  ilehned  earlier 
-Vext.  denote  by  D,  the  stC)  x  s(f  )  matrix  definetl  as  ^  ^  •  '^  li‘‘re  I 

indicates  transpose. 

Condition  2  Let  .V,  =  the  number  of  .Y,  s  that  belong  to  t.  anil  .\„  = 
min,g7'^{^(t)}*'".V,.  Then  .V.,,  logn  —  -x.  as  n  ^  tc. 

Condition  3  Let  A,  be  the  smallest  eigenx-alue  of  .V,"  ‘  D,  and  UtX^  = 
niinij7,  A|.  Then  A„  remains  bounded  away  from  zero  in  probability  as  n  — 
TC. 

Clearly.  Condition  2  ensures  that  there  will  be  sufficiently  many  observa¬ 
tions  in  each  of  the  sets  in  T„  irf.  Condition  ( 12. X)  in  Breiman  et  al.  i  IQ'^A) 
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and  Condition  (b)  in  Chaiuihuri  et  al.  (1993)).  Condition  3.  on  the  other 
hand,  guarantees  that  for  large  sample  size,  each  of  the  matrires  D,'s  will  be 
nonsingular  and  nicely  behaved  (cf.  Condition  (c)  in  Chaudhuri  et  al.  (  1993)) 
with  a  high  probability.  In  a  sense,  it  ensures  regularity  in  the  behavior  of 
the  Fisher  information  matrix  associated  with  the  finite-dimensional  model 
fitted  to  the  conditional  distribution  within  each  -.et  in  T„.  .Note  that  we  are 
fitting  a  polynomial  of  a  fixed  degree  with  a  finit(>  number  of  coefficients  to 
the  data  points  corresponding  to  any  .set  in  T^. 

Finally,  we  will  state  a  Cramer-type  regularity  condition  on  the  condi 
tional  distribution  of  the  response  given  the  regres.sor  This  condition  is 
absolutely  crucial  in  establishing  desirable  asymptotic  behavior  of  our  esti 
mates,  which  are  constructed  via  ma.ximum  likelihood  techni(|ue. 

Condition  4  Let  us  view  the  pdf/pmf  /(yi.s)  as  a  function  of  two  ran- 
ables  so  that  s  becomes  a  real-valued  parameter  varying  in  a  betunded  open 
interval  J.  Here  J  is  such  that  as  i  varies  over  some  open  set  contain¬ 
ing  C.  ij\i)  takes  its  values  in  J.  The  support  of  fiy\>i  for  any  given 
■s  €  T  IS  the  same,  and  it  does  not  depend  on  s.  Also,  logf/iy  >i|  i.>  thru 
times  continuously  differentiable  ir.r.t.  >  for  any  given  value  of  y.  and  let 
fliyis)  and  Hiyls)  be  the  first,  srvond  and  third  de  rivative  >  re.sjier- 
lively  o/ log{/(y|s)}  w.r.t.  s.  The  random  variable  .!(>  >1  has  :f  m  rriean. 
and  the  mean  of  fl(y'ls)  i.s  negative  and  stay.s  away  from  :f  re>  as  s  vanes 
in  J.  Here  V'  has  pdf/pmf  f{y\s).  and  there  en.sts  a  nonnegative  function 
h'(y)  w'htch  dominates  each  o/.4(j/|.s(.  H{y\s)  and  H{y\s\  for  all  values  of 
■•t  €  V  (i.e.,  |4(v|j»)|  <  l\{y).  .Hiyis)!  <  h'ly)  and  'H\y\si]  <  liiyij.  Fur¬ 
ther.  if  .\f(w.s)  denotes  the  moment  getierating  function  of  hi)')  tie  fined 
as  .V/(u;.s)  =  £  [exp{  u/\’(  1  ) }  i  with  Y  having  jrelf  prnf  f\y  'i.  Miw.s]  re  ■ 
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mains  bounded  as  w  varies  over  an  open  interval  around  the  origin  and  s 
varies  over  J . 

It  is  appropriate  to  note  here  that  Condition  4  is  trivially  satisfied  when 
the  response  Y  is  binary  in  nature,  or  more  generally,  when  its  conditional 
distribution  given  the  regressor  is  binomial,  and  >  is  the  logit  of  the  prob 
ability  parameter  such  that  the  probability  remains  boundeil  away  from  0 
and  1.  .\s  a  matter  of  fact,  this  condition  will  hohl  whenever  the  conditional 
distribution  of  the  response  belongs  to  a  standard  exponential  famdv  (e  g  , 
binomial.  Poisson,  exponential,  gamma,  normal,  etr  ).  and  is  the  natural 
parameter  taking  values  in  a  bounded  interval.  If  /ti/l')  happens  to  be  a 
location  model  with  .s  behaving  like  a  location  parameter  varying  over  a 
bounded  parameter  space.  Condition  4  remains  true  for  several  important 
cases  like  the  Cauchy  or  an  exponential  power  distribution  i  see  e  g..  Box  and 
Tiao  (1973)).  In  a  sense,  this  condition  can  be  viewed  as  an  extension  of 
Condition  (12.12)  in  Breiman  et  al.  (lO-Sdland  Condition  idi  in  Chaudhuri 
et  al.  ( 1993). 

Theorem  1  Suppose  that  Conditions  I  through  4  hold  Then  there  is  a 
choice  of  the  maximum  likelihood  estimate  0,  f/xessibly  a  lex-al  maximizer  of 
the  likelih(x>d)  for  every  t  €  such  that  given  any  u  t  . 

max  sup  |Z}“P(  j.  (3,.  .Y, )  -  D'‘gt  x  i|  —  0  eis  n  —  zc. 

<€T, 

The  above  theorem  guarantees  that  there  exists  a  choice  of  the  maximum 
Likelihood  estimate  0,  for  each  (  €  so  that  the  resulting  piecewise  polyno¬ 
mial  estimates  of  the  function  g  and  its  ilerivatives  are  all  consistent  .Now, 
it  can  very  well  happen  that  the  estimate  0,  is  only  a  local  ma.ximizer  of  the 
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likelihood  instead  of  being  a  global  maximizer.  For  instance,  the  likelihood 
based  on  the  data  points  corresponding  to  a  set  in  r„  may  have  multiple 
maxima.  However,  when  the  conditional  distribution  of  the  response  given 
the  regressor  belongs  to  a  standard  exponential  family,  strict  concavity  of 
the  loglikelihood  guarantees  uniqueness  of  the  ma.ximum  likelihood  estimate 
in  large  samples.  -In  the  special  case  when  we  fit  a  constant  (i.e.,  a  polyno¬ 
mial  of  degree  zero)  to  the  data  points  corresponding  to  each  set  in  r„  using 
the  ma.ximum  likelihood  approach.  Theorem  1  gives  a  useful  generalization 
of  the  consistency  result  that  holds  for  piecewise  constant  tree-structured 
regression  estimates  discussed  in  Breiman  et  al.  (1984).  The  piecewise  poly¬ 
nomial  estimates  of  g  and  its  derivatives  are  not  continuous  everywhere  in 
the  regressor  space.  Smooth  estimates,  which  can  be  constructed  by  com¬ 
bining  the  polynomial  pieces  by  means  of  smooth  weighted  averaging,  will 
be  consistent  provided  the  weight  functions  are  chosen  properly.  Theorem  2 
in  Chaudhuri  et  al.  ( 1993)  describes  a  way  of  constructing  families  of  smooth 
weight  functions  that  will  give  smooth  and  consistent  estimates  of  g  and  its 
derivatives. 


5  Appendix:  The  proofs 


VVe  begin  by  proving  some  preliminary  results  that  will  be  used  in  the  proof 
of  Theorem  1.  Unless  stated  otherwise,  all  vectors  are  assumed  to  be  column 
vectors  and  a  superscript  T  denotes  transpose. 

Lemma  1  Under  Conditions  I,  2  and  4,  we  have 


maxiV,-'{i(0}"" 

ter.  '  '  ' 


53  [A{Yi\P{X.,Q;,X,)}]  r. 

x,n 


0  as  n 


00  . 
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Here  0"  is  the  3(1'  )-dimensioti(il  vector  with  a  typical  entry  ) 

where  u  €  1'.  In  other  words.  P{r.Q^.  X,)  is  nothing  but  the  Taylor  poly¬ 
nomial  of  g(x)  expanded  around  .V,. 

Proof.  First  observe  that  a  straight  forward  application  of  the  mean 
value  theorem  of  differential  calculus  yields  the  following 

=  .vr‘{^(<)}-"‘  Y  [--lO'.M-V.ilir. 

A'.€* 

-  lI{r,(.Y...Y,)fl(>;|z,)}r.  di 

A'.ei 

where  Z,  is  a  random  variable  that  lies  between  P(X,.Q' .  X,)  and  g(X,).  In 
view  of  Condition  4,  the  conditional  mean  of  /l{V'!jf(.Y )}  given  X  =  x  is  zero, 
and  if  we  denote  its  conditional  moment  generating  function  by  .V/i(u'|t). 
there  exist  constants  ^'i  >  0  and  Pi  >  0  such  that  .\fi(w\x)  <  2exp(^iU'-/2) 
for  all  X  €  C  and  0  <  u'  <  p,  (see  the  arguments  at  the  beginning  of  Lemma 
12.27  in  Breiman  et  al.  (19^4)).  .At  this  point,  pretend  that  t  is  a  fi.xed 
non-random  polyhedron  in  R'^.  all  the  data  points  .\,  s  that  fall  in  t  form  a 
collection  of  deterministic  points  in  C.  and  the  corresponding  .4{V',|5(.\,)}  s 
form  a  set  of  independent  random  variables  such  that  the  distribution  of 
diy.Isf.Y,)}  is  the  same  as  the  conditional  distribution  of  it  given  .\,  in 
the  original  problem.  Note  that  F,  is  an  s(l  )-dimensional  vector  with  each 
component  bounded  in  absolute  value  by  1.  The  arguments  used  in  handling 
the  "variance  term"  in  the  proof  of  Theorem  1  in  Chaudhuri  et  al.  (199.3) 
imply  that  there  exist  constants  k-,  >  0.  kj  >  0  and  p-,  >  0  (which  depend 
only  on  the  compact  set  C.  the  integer  s{l')  and  the  constants  ki,  p\)  such 
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<  A;.,  ex  p  [  -  A-3  { /i  ( < ) } ''"  A'r  P'  ] 

<  A.-3exp(-A3.V.,p-), 

whenever  p  <  p^.  Observe  that  the  first  ineriuality  above  is  a  consequence 
of  Lemma  12.26  in  Breiman  et  al.  ( 19H4),  which  can  be  applied  to  each  real¬ 
valued  component  of  the  s(  f  ’)-dimensional  vector  that  appears  here.  Recall 
at  this  point  that  each  set  in  T„  is  a  polvliedron  in  R*  having  at  most  .\/ 
faces.  The  fundamental  combinatorial  result  of  Vapnik  and  Chervonenkis 
(1971)  (Dudley  1978,  Section  7)  now  implies  that  there  e.xlsts  a  collection 

C  of  subsets  of  the  set  {A'l.A'j . A'^}  such  that  #(C)  <  (2n)''"''*‘^’.  and 

for  any  polyhedron  with  at  most  M  faces,  there  is  a  set  t'  €  C  with  the 
property  that  .Y,  €  t  if  and  only  if  .V,  €  t‘.  Hence,  even  for  a  collection 
like  Tr,  consisting  of  random  polyhedrons  generated  by  an  adaptive  recursive 
partitioning  algorithm,  we  must  have  the  following  exponential  bound  for 
the  conditional  probability  given  the  A'.’s  and  7„  ti.e..  after  the  sets  in  Tr, 
are  specified ). 

Pr(max,t7-.{<5(0}""’.Vr'  lEx.ei  -''HI  r.|  >  p|.V,..V2 . Vn.7'..) 

<  {2n  exp(  -A-3.\„p-' ), 

It  now  follows  from  Condition  2  that 


max{(5(t)}"'".V,'‘  ^  [-M  -V. )}]  F,  0  as  u  —  x 

f  €  *  •»  i-  - 
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For  the  second  term  on  the  right  of  (1).  we  have  using  Conditions  1.  2 
and  4 


max.V,  ‘{<^(t)}' 


X,it 


max{(lt(f)}'"'siip|r,(j,.V,) 


( max.V,-'  Y  A'OJir.ll 

f  “  A .  €  >  J 


0  as  n  — '  3C. 


Note  that  we  are  using  the  fact  that  maxi^r.  'i  '  Hx  €i  remains 

bounded  in  probability  as  n  —  x  in  view  of  the  boundedness  of  the  vectors 
r,'s  and  Conditions  2  and  4.  In  fact,  if  ^t(i}  denotes  the  conditional  mean 
of  A'(}’)  given  .Y  =  x.  arguments  identical  to  those  used  in  handling  the 
first  term  on  the  right  of  ( 1 ;  yield 


max  .V,"  ‘ 
ler,  ' 


X;{A'(r.)-^d.Y.)};r.ii 

A',€(  ! 


p 


0  as  n 


X. 


This  completes  the  proof  of  the  lemma. 


Lemma  2  Consider  the  slCj  x  s{(')  matrix 


-  .v,-‘  Y  [B{y.\P{X..e:..\.)\  r,i! 

A,€i 

and  let  7(f)  denote  its  smallest  eigenvalue.  Define  1.,  =  min,5r,  '(t)  fhen. 
under  Conditions  I  through  4-  ‘n  lyniains  intsitive  and  bounded  away  from 
zero  in  probability  as  n  —  x. 
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Prnofi  Once  again,  let  us  write  using  the  ineaii  value  theorem  of  differ¬ 
ential  calculus 

.v-=  ^  \D{y,  Pi\,A-):..\,t}]  r.r: 

X.€t 

=  Y,  -  d.v.iir.iY  -  -v;‘  Y 

V.€l  .V,€l 

+  .V,-^  Y  {c,‘.v,..v,,//(V.,i.,}r,rf. 

V.-tf 

where  i.'iJ')  is  the  conditional  mean  of  B{)'  yi.Vi}  given  .V  =  x.  and  I',  is 
a  random  variable  that  falls  lietween  .md  /'!  .V,,  t)’.  ,V,  I.  Now  it  i> 

obvious  Iroiii  ( 'ondition.^  ■(  ami  1  '  hat  if  =  inin,^r.  t/i  / 1.  w  here  t/i  t  i  iv  t  he 
smallest  eigenvalue  of  the  matrix  -N;''^x  ji  f  .V,  if ,  1/".  then  i;.,  remains 
positive  and  bounded  away  from  zero  in  probability  as  u  —  x .  On  the 
other  hand,  the  first  term  on  the  right  of  ■  Ji  ,  an  he  handled  m  the  same 
way  as  the  first  term  on  the  right  of  ■  1  i  in  the  proof  of  Lemma  1  to  vield 

m.ix  ^  H  {)  ,  i/i  \  ,  ■]  -  t.1  V,  '  I  ,  1  f  —  t)  as  n  —  X 

"  ‘  ’  V  . «  1 

.Note  that  t  he  arguments,  w  Im  li  exploit  ( 'ondilioiis  2  and  land  uereapjuted 
to  each  component  of  the  si  I  edimeiisional  vet  tor  appearing  as  t  he  first  term 
on  the  right  of  (  1  i.  can  be  easily  moilifie<l  for  ea<  h  eiitrv  of  the  si  f  i  ■  m  f  i 
matrix  here.  Finally,  using  (oiubtions  1.  2  and  1.  and  arguments  that  are 
virtually  the  same  a.s  those  employed  to  treat  the  second  term  on  the  right 
of  (  1  I  in  the  proof  of  Lemma  1.  we  ohlain  the  tolhiwing  result  tor  the  third 
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term  on  the  right  of  (2): 


max.v,  ‘  ^ {r,(.v.. .v,)//( i;iv;)}r.rf' 


<  I  maxsup  .V,)j  U  max  .V,  ‘  V  A'iV.iir.r,^ 

*■  ■*  I  ,vTti 

P  n 

—  0  as  n  —  oc. 


This  completes  the  proof  of  the  lemma. 

Proof  of  Theorem  1:  First  note  that  the  assertion  in  the  Theorem  will 
follow  if  we  can  show  that  there  exist  <  hoices  for  the  maximum  likelihood 
estimates  0,'s  such  that 


max{('!  f )}  ■"'  (•),  -  (-)■  —  0  as  n  —  x 


For  t  i  r.,,  let  l,U-Ji  denote  the  loglikidihood  ba.sed  on  >  he  i  diser'.  at  ions 
such  that  .V,  ^  f  lhat  i«.  /,(■)>  =  Hv.Jo^  ./{),  A',  ' }' 

For  ()  >  I).  dehne  the  event  £,',)■  as  follows  ,s  a  .ntnace 

'll  can  lie  lo(  allv  concave,  t'inition  .n  .i  ne;ii|,!,.,rhiiod  "t  (•)'  Aiin  ra.;,  is 
1  (til  f  I  }'’‘p  1 1  e  .  for  (•)  sal  i'fc  ini'  1 1*1 1 .  j ' -  W;  •_  p  u  and  ii  h.i.s  ,i  maximum 
I  whii'h  ran  be  a  local  maximum  il  /,i(-)i  has  secerai  maximal  in  li.e  interior 
of  this  neighborhood  }  Note  t  h.il  l  he  oi  <  urreni  e  of  t  his  event  implies  t  n.ii 
the  maximum  likelihood  eipiation  olitameii  b-.  Uilleieiiu.iime  ..  W.  w  r  t 
0  will  have  a  root  0,  such  that  K.til*"'  0-  -  (-)’  <  i>  Now  a  lavlor 

expansion  of  /d0)  around  0r  iiehls 


=  /p0;i-  V  (-)  -  h:.’ I,  111. /'  v,.0:  v, 


111!'.  J'l.  I'm! 


» 


+  ( 1  /2 )  ^  ( (-)  -  0:  j'T, rf  .V..  0; .  .V, , 0  -  0; ) 

A,t( 

^(i/(j)  ^  {(0-  0;)^r.}^//ir,;ir.).  cn 

V,€l 

where  U’,  is  a  random  variai)le  lyim»  between  P(  A’,,  O? .  A’, )  and  P{  A', ,  0.  A',  i. 
For  the  third  term  on  the  ri»ht  of  (.{).  recall  that  the  T.'s  are  bounded 
vectors.  Also,  for  0  in  a  suiriciently  small  iieiijhborhood  of  0,’.  we  have 
^ 1  )i  <  i»  view  of  Condition  4.  It  now  follows 

from  Lemmas  1  and  2  and  some  of  the  arguments  used  in  their  proofs  that 
there  e.xjsts  Ps  >  0  such  that  whenever  p  <  p^,  we  must  have 

Pr  i  Pi  £,(/))  i  —  1  as  n  —  tx. 

lisr,  J 

The  proof  of  the  theorem  is  now  complete. 
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